Co-Design Techniques for Fault-Tolerant Real-Time Systems using Imperfect Fault Detectors
ثبت نشده
چکیده
To meet the reliability requirements of safety-critical embedded systems, fault tolerance techniques such as active redundancy are widely adopted. Fault-tolerant system design using active redundancy is a very challenging task that involves solving two major problems, namely finding the optimal utilization of temporal and/or spatial redundancy and the scheduling of tasks (including replicas) under timing constraints. Over the past decades, a lot of research efforts have been devoted to this field. To cope with the high problem complexity, many state-of-the-art studies make simplifying assumptions on the fault models and modes. Perfect fail-silent behavior is one assumption that is often used in literature. It is assumed that all faults are detected within a certain time interval and that the fault detection overhead is contained in the tasks’ Worst-Case Execution Times (WCETs), e.g., in faulttolerant task scheduling [1, 2, 3, 4, 5], in reliability-aware energy management [6, 7, 8] and in error-aware system design [9, 10]. With this assumption, each task will produce either a correct output or no output at all. Although fail-silence is a highly desirable property, it is difficult to implement in practice. The prerequisite is the existence of a perfect fault detector that achieves 100% coverage under the given fault hypothesis. In the previous study [11], we have explained the major problems of this assumption. On the one hand, this assumption is very impractical; on the other hand, even if it is implementable, using perfect fault detection is often a suboptimal design decision, due to the fact that good fault detectors usually come with high timing overheads [12, 13]. Actually, when active redundancy is concerned, there is a tradeoff about whether the available resources should be spent on implementing better fault detection or realizing more redundancy. We have developed new analysis and optimization techniques to tackle these issues. Experimental results show that certain designs involving imperfect fault detectors combined with task replication can outperform other designs assuming perfect fault detection. So far, only software-implemented fault detection is considered. However, as shown in [10], fault detection could also be implemented in hardware to reduce the time overhead, e.g. using on-chip reconfigurable FPGA fabric. This not only contributes to reducing the schedule length but also allows more options for redundancy. Unfortunately, hardware fault detection increases the overall system cost. In particular, the on-chip resources are often not sufficient to implement hardware fault detectors for all tasks. Hence, it is a major design decision to select which fault detector to implement for each task and where to implement them. In Figure 1 we show an example scenario extended from the motivating example of [11]. Figure 1a depicts the schedule using the perfect detector (it is assumed that perfect fault detection incurs 300% timing overhead). Figure 1b is another possible schedule, in which the task is replicated twice and the remaining time (200% task execution time in this case) is used to implement two partial fault detectors. Figures 1c and 1d show two similar schedules with higher number of replications. Figure 2 compares the reliability of those schedules, in terms of the probability of detectable (DUF in the figure) and undetectable (SDC in the figure) faults. As it can be seen, the design with perfect fault detection can detect all faults. However, only detecting the faults is often not sufficient, e.g. for fail-operational applications. When multiple replicas of the same task are available, we have another mean of fault detection, that is, to compare the output from different instances (voting). Actually, certain faults might even be corrected, e.g. a single faulty input out of three inputs will be masked by voting. In this case, the schedule with partial fault detectors might have higher reliability (see [11]). Figure 1e depicts one schedule that has not been considered so far. In this schedule, the fault detector is implemented in hardware to reduce the timing overhead. This allows us to schedule two instances of the task, both
منابع مشابه
Co - design of Fault - Tolerant Systems with Imperfect Fault Detection
In recent decades, transient faults have become a critical issue in modern electronic devices. Therefore, many fault-tolerant techniques have been proposed to increase system reliability, such as active redundancy, which can be implemented in both space and time dimensions. The main challenge of active redundancy is to introduce the minimal overhead of redundancy and to schedule the tasks. In m...
متن کاملOn the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems
We investigate whether fast failure detectors can be useful — and if so by how much — in the design of real-time fault-tolerant systems. Specifically, we show how fast failure detectors can speed up consensus and fault-tolerant broadcasts, by providing fast algorithms and deriving some matching lower bounds, for synchronous systems with crashes. These results show that a fast failure detector s...
متن کاملAn approach to fault detection and correction in design of systems using of Turbo codes
We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...
متن کاملNovel Defect Terminolgy Beside Evaluation And Design Fault Tolerant Logic Gates In Quantum-Dot Cellular Automata
Quantum dot Cellular Automata (QCA) is one of the important nano-level technologies for implementation of both combinational and sequential systems. QCA have the potential to achieve low power dissipation and operate high speed at THZ frequencies. However large probability of occurrence fabrication defects in QCA, is a fundamental challenge to use this emerging technology. Because of these vari...
متن کاملA framework for reliability-aware design exploration on MPSoC based systems
Applying system-level fault-tolerant techniques such as active redundancy is a promising way to enhance the system reliability for safety-related applications. Embedded system design using active redundancy is a challenging task that involves solving two major problems, namely finding the optimal redundancy configuration and mapping/scheduling of the application (including the redundant compone...
متن کامل